Modeling and prediction
of Nitrogen Dioxide (\(\small{NO_2}\)) levels in urban areas

Final project for the course of
Statistical Methods of Data Science

Submitted by: Francesco Russo, 1449025Submitted to: Prof. Luca Tardella

Introduction

The air of modern cities contains several different pollutants. Many of them are the direct product of combustion processes in industrial facilities and vehicles powered by fossile fuels, and are therefore known as primary pollutants.

These include Carbon Monoxide \(CO\), Carbon Dioxide \(CO_2\), Nitric Oxide \(NO\), and the microscopic solid particles known as \(PM2.5\) and \(PM10\).

Other substances, like Ozone \(O_3\) and Nitric Acid \(HNO_3\) are instead the product of chemical reactions among the primary pollutants, and are therefore known as secondary pollutants.

Lastly, in the case of Nitrogen Dioxide \(NO_2\), only a small percentage (less than \(5 \%\)) is the direct product of combustion processes, i.e. linked to primary pollution, while most of it is obtained as a product of the chemical reactions involving \(NO\) and \(O_3\), i.e. through secondary pollution. Indeed we have that:

\[ NO + O_3 \rightarrow NO_2 + O_2 \]

and

\[ 2NO + O_2 \rightarrow 2NO_2 \]

The chronic exposure to Nitrogen Dioxide can cause airway inflammation and therefore respiratory problems, which are way more harmful for people with pre-existing respiratory issues like asthma, or lung diseases.

Furthermore, high concentrations of \(NO_2\) in the air can react with the water in the atmosphere producing Nitric Acid \(HNO_3\) through the reaction:

\[ 3NO_2 + H_2O \rightarrow HNO_3 + NO \]

The presence of Nitric Acid in the atmosphere is at the root of phenomena like acid rains, and all the related problems for people and infrastuctures.

The goal of this project, inspired by the paper http://www.aaqr.org/files/article/1089/4_AAQR-10-07-OA-0055_128-139.pdf (Han S. et al., Aerosol and Air Quality Research, 11: 128–139, 2011) is to create a fully bayesian regression model, capable of predicting the level of the Nitrogen Dioxide \(NO_2\) in an urban area (i.e. Madrid), knowing the concentration of Nitric Oxide \(NO\) and Ozone \(O_3\). The dataset we are going to use is freely available on the Kaggle website at https://www.kaggle.com/decide-soluciones/air-quality-madrid in the form of the .csv files madrid_2018.csv and stations.csv. The file madrid_2018.csv contains hourly averages for several pollutants, measured by 24 different control stations in Madrid, over a timespan of about 5 months (January ~ May) in 2018. However, we are only going to use the data for one day, January 2nd 2018. The file stations.csv contains additional information about the control stations (i.e., latitude, longitude, elevation).

main_file = read.csv("madrid_2018.csv")
main_file = main_file[51793:52368,]
station_file = read.csv("stations.csv")

The elevation of the control stations will be used in the bayesian analysis, so, as a first step, we are going to insert this information into the main file, creating the new dataset madrid_2018_mod.csv (of course, we need to perform this operation only once)

elevation = c()

for (i in 1:nrow(main_file)){
  elevation[i] = station_file[station_file$id == main_file$station[i],]$elevation
}

main_file$elevation = elevation

write.csv(main_file, file = "madrid_2018_mod.csv")

Theoretical framework

Chemico-Physical
model

We have seen how Nitric Oxide \(NO\) and Oxygen \(O_2\) and Ozone \(O_3\) can react to form Nitrogen Dioxide \(NO_2\) through the redox reactions:

\[ NO + O_3 \rightarrow NO_2 + O_2 \quad \quad \text{and} \quad \quad 2NO + O_2 \rightarrow 2NO_2 \]

so that summing the corresponding members of the two reactions we get:

\[ 3NO + O_3 \rightarrow 3NO_2 \]

We also know from basic chemistry that in every reaction \(\alpha A + \beta B \rightarrow \gamma C\) we have a reaction quotient:

\[ Q = \frac{[C]^{\gamma}}{[A]^{\alpha}[B]^{\beta}} \]

where the \([\quad]\) notation, as usual in chemistry, stands for the density of the substance.

The reaction quotient, at the equilibrium, is equal to the reaction constant \(K\).

In our case, when we reach equilibrium, we therefore have the following relation between concentrations:

\[ K = \frac{[NO_2]^3}{[NO]^3[O_3]} \]

and switching to the logarithms we get the equation:

\[ 3\text{log}[NO_2] = 3\text{log}[NO] + \text{log}[O_3] + \text{log}K \]

which is a linear relation among the logarithmic concentrations.

It’s therefore reasonable to expect the \(\text{log}[NO_2]\) to be linearly correlated to the \(\text{log}[NO]\) and \(\text{log}[O_3]\) in our data.

We may now think to model the relation among the reactants with the linear equation:

\[ Y = a_1 X_1 + a_2 X_2 + b \]

where \(Y = \text{log}[NO_2]\), \(X_1 = \text{log}[NO]\), \(X_2 = \text{log}[O_3]\)

We should notice, however, that the concentration of the \(O_3\) is actually the result of more complicated cycles.

Indeed, the \(O_3\) can be splitted by the interaction with th UV radiation of the sun through the photolysis processes:

\[ O_3 + h \nu \rightarrow O_2 + O \]

The molecular Oxygen \(O_2\) itself can undergo photolysis:

\[ O_2 + h \nu \rightarrow 2O \]

and then recombine with the atomic Oxygen, to form Ozone:

\[ O_2 + O \rightarrow O_3 \]

This means that the \(log[NO_2]\) could actually follow a more complicated dependence on the \(log[O_3]\) than the simple linear relationship we obtained above.

Lastly, it’s worth mentioning that the elevation of the control station may have an influence on the measured concentration of the reactants.

Indeed, the height of the mixing layer could affect the conditions (i.e. temperature, humidity, pressure, turbulence, etc.) in which the chemical and physical processes take place.

Let’s have a look at the data to sort out all of this:

Data collection and exploration

The dataset madrid_2018_mod.csv contains \(576\) observations for the hourly averaged concentrations of several pollutants, measured by different stations, plus the elevation of the corresponding station. We are only interested in the concentration of \(NO\), \(O_3\) and \(NO_2\).

We are going to treat the observations as independent, since chemical concentrations, thermodynamically, are state variables. This means that their value is a function only of the current state of the system (in the case of a gas, concentration is a function of temperature, pressure, volume, and concentrations of other reacting gases), and is independent of the values at previous times (i.e. the “history” of the gas itself).

train = read.csv("madrid_2018_mod.csv")
knitr::kable(align = "c", format = "markdown", train[1:10,c(2,8,9,11,17,18)])
date NO NO_2 O_3 station elevation
2018-01-02 00:00:00 1 11 NA 28079004 635
2018-01-02 00:00:00 4 34 46 28079008 670
2018-01-02 00:00:00 5 40 NA 28079011 708
2018-01-02 00:00:00 3 35 37 28079016 693
2018-01-02 00:00:00 1 22 49 28079017 604
2018-01-02 00:00:00 1 12 66 28079018 630
2018-01-02 00:00:00 1 2 74 28079024 642
2018-01-02 00:00:00 1 23 48 28079027 621
2018-01-02 00:00:00 2 22 33 28079035 659
2018-01-02 00:00:00 2 25 NA 28079036 685

The concentrations of \(NO\), \(O_3\), \(NO_2\) are all expressed in \(\mu g/m^3\). The elevation of the measuring station is expressed in meters \(m\).

We are not going to use the data from all of the 24 measuring stations, especially because in some of them the records relative to the \(O_3\) concentrations are completely missing.

In practice, we will use the data from 5 measuring stations, chosen according to the quality of the data, and such that they cover, more or less uniformly, the range of possible elevations (which goes from \(\sim 600 \space m\) to \(\sim 700 \space m\)).

Here are the IDs and elevations of the chosen stations:

Let’s create the corresponding datasets, each one with \(24\) observations:

train1 = train[train$station == 28079016, c(8,9,11,17)]
train2 = train[train$station == 28079008, c(8,9,11,17)] 
train3 = train[train$station == 28079035, c(8,9,11,17)] 
train4 = train[train$station == 28079027, c(8,9,11,17)] 
train5 = train[train$station == 28079056, c(8,9,11,17)]

We will now fill the \(NAs\) using the median of the corresponding column:

fillna = function(data){
  for (i in 1:ncol(data)){
    data[is.na(data[,i]),i] = median(data[,i], na.rm = TRUE)
  }
  return(data)
}

train1 = fillna(train1); 
train2 = fillna(train2); 
train3 = fillna(train3);
train4 = fillna(train4); 
train5 = fillna(train5); 

Then, we will compute the logarithmic concentrations for the pollutants:

train1$logNO = log(train1$NO); train1$logNO_2 = log(train1$NO_2); train1$logO_3 = log(train1$O_3)
train2$logNO = log(train2$NO); train2$logNO_2 = log(train2$NO_2); train2$logO_3 = log(train2$O_3)
train3$logNO = log(train3$NO); train3$logNO_2 = log(train3$NO_2); train3$logO_3 = log(train3$O_3)
train4$logNO = log(train4$NO); train4$logNO_2 = log(train4$NO_2); train4$logO_3 = log(train4$O_3)
train5$logNO = log(train5$NO); train5$logNO_2 = log(train5$NO_2); train5$logO_3 = log(train5$O_3)

train = rbind.data.frame(train1, train2, train3, train4, train5)

We can now show some visualizations. For simplicity, we will only plot data for the collective dataset (all the 5 stations together)

We can see that, as we might have expected, while the \(log[NO_2]\) follows a linear relation with the \(log[NO]\), it doesn’t follow a linear relation with the \(log[O_3]\), since there are more complicated processes leading the concentration of the Ozone.

However, the \(log[NO_2]\) seems to show a good linear correlation with the \([O_3]\)

.

We can therefore model the relation among the reactants as:

\[ Y = a_1 X_1 + a_2 X_2 + b \]

where \(Y = \text{log}[NO_2]\), \(X_1 = \text{log}[NO]\), \(X_2 = [O_3]\)

Theoretical framework

Statistical
models

Frequentist modeling

The frequentist model we are going to use will follow two opposite approaches: the pooled regression (i.e. a single regression model for all the observations together, independently from the measuring station), and the non-pooled regression (i.e. a different regression model for each one of the measuring station).

The pooled approach finds its reasons in the fact that all the observations are, after all, related to the same quantities, so it makes sense to put them together in order to have the largest possible amount of data to fit our model.

The non-pooled approach finds its reasons in the fact that, as we have already noticed, the measuring stations are located at different heights, and use different sensors, so it might make sense to treat them differently.

Pooled regression


In the pooled case we have a single classical regression model:

\[ Y_i = a_{1} X_{i,1} + a_{2} X_{i,2} + b + \epsilon \]

where \(Y = \text{log}[NO_2]\), \(X_1 = \text{log}[NO]\), \(X_2 =[O_3]\), and \(\epsilon \sim N(0,\sigma^2)\).

with the likelihood

\[ L(a_1, a_2, b, \sigma^2) = \frac{1}{(2 \pi \sigma^2)^{n/2}} e^{\displaystyle{-\frac{\sum{[y_i - (a_{1} x_{i,1} + a_{2} x_{i,2} + b)]^2}}{2 \sigma^2}}} \]

that we are going to maximize in order to get the MLE estimates of the parameters.

Non-pooled regression


In the non-pooled case we have a different regression model for each one of the measuring stations, labeled as \(s = 1,2,3,4,5\):

\[ Y_i = a_{s,1} X_{i,1} + a_{s,2} X_{i,2} + b_s + \epsilon \]

So, we are going to find the MLE estimates of the parameters for each one of them separately.

Bayesian modeling

In the bayesian model, on the other hand, we are going to try, at first, a pooled approach, just like in the frequentist case.

Pooled regression

We have, as in the frequentist case:

\[ Y_i = a_{1} X_{i,1} + a_{2} X_{i,2} + b + \epsilon \]

where \(Y = \text{log}[NO_2]\), \(X_1 = \text{log}[NO]\), \(X_2 =[O_3]\), and \(\epsilon \sim N(0,\sigma^2)\).

If we now set:

\[ \vec{a} = \begin{bmatrix} a_{1}\\ a_{2}\\ b\\ \end{bmatrix} \quad \quad \quad \vec{X} = \begin{bmatrix} X_{i,1}\\ X_{i,2}\\ 1\\ \end{bmatrix} \]

we have

\[ Y_i = \vec{a}^T \cdot \vec{X_i} + \epsilon = \mu_i + \epsilon \]

However, in the bayesian framework, we can now put a prior on \(\vec{a}\) and \(\sigma\). Opting for non-informative priors, we set \(\vec{a} \sim N(\vec{\theta_a}, \Sigma_a)\), with

\[ \vec{\theta_a} = \begin{bmatrix} 0\\ 0\\ 0\\ \end{bmatrix} \quad \quad \quad \quad \tau_a = \Sigma_a^{-1} = \begin{bmatrix} 0.001 & 0 & 0 \\ 0 & 0.001 & 0 \\ 0 & 0 & 0.001 \end{bmatrix} \]

and \(\tau^2 = \frac{1}{\sigma^2} \sim Gamma(\alpha, \beta)\), with \(\alpha = \beta = 0.001\).

Hierarchical model


We are now going to use a hierarchical bayesian regression model. This approach allows to take into account both the differences and similarities among the measuring stations.

For each measuring station \(s = 1,2,3,4,5\) we have a regression model, expressed as:

\[ Y_i = a_{s,1} X_{i,1} + a_{s,2} X_{i,2} + b_s + \epsilon \]

where \(Y = \text{log}[NO_2]\), \(X_1 = \text{log}[NO]\), \(X_2 =[O_3]\), as already said, and \(\epsilon \sim N(0,\sigma^2)\).

If we now set:

\[ \vec{a_s} = \begin{bmatrix} a_{s,1}\\ a_{s,2}\\ b_s\\ \end{bmatrix} \quad \quad \quad \vec{X} = \begin{bmatrix} X_{i,1}\\ X_{i,2}\\ 1\\ \end{bmatrix} \]

we have

\[ Y_i = \vec{a_s}^T \cdot \vec{X_i} + \epsilon = \mu_i + \epsilon \]

This seems to be the bayesian version of the non-pooled regression, where we had a different set of parameters \(\vec{a_s}\) for each one of the measuring stations, taking into account the differences among them, but not the similarities.

However, we now put a common prior on all the \(\vec{a_s}\) and a prior on \(\sigma\), setting \(\vec{a} \sim N(\vec{\theta_a}, \Sigma_a)\), and \(\tau^2 = \frac{1}{\sigma^2} \sim Gamma(\alpha, \beta)\), with \(\alpha = \beta = 0.001\).

This means “pulling” all of the \(\vec{a}_s\) towards a common mean, though still allowing them to be different from each other.

We have then the (hyper)priors \(\vec{\theta_a} \sim N(\vec{\theta_t}, \Sigma_t)\), and \(\tau_a = \Sigma_a^{-1} \sim Wishart(\phi, \nu)\), with

\[ \vec{\theta_t} = \begin{bmatrix} 0\\ 0\\ 0\\ \end{bmatrix} \quad \quad \quad \quad \tau_t = \Sigma_t^{-1} = \begin{bmatrix} 0.001 & 0 & 0 \\ 0 & 0.001 & 0 \\ 0 & 0 & 0.001 \end{bmatrix} \]

while for the Wishart distribution we have \(\phi = 3\) degrees of freedom and scale matrix:

\[ \nu = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \]

Hierarchical model (with group-level predictors)


Lastly, we are going to use a hierarchical bayesian regression model with group-level predictors, which provides different levels of regression. This approach, once again, allows to take into account both the differences and similarities among the measuring stations, but also the information linked to their elevations.

The first level is defined this way: for each measuring station \(s = 1,2,3,4,5\) we have an observation-level regression model, expressed as:

\[ Y_i = \vec{a_s}^T \cdot \vec{X_i} + \epsilon = \mu_i + \epsilon \]

Once again, we are going to opt for non-informative priors.

We can put a prior on \(\vec{a_s}\) and \(\sigma\), setting \(\vec{a} \sim N(\vec{\theta_a}, \Sigma_a)\), and \(\tau^2 = \frac{1}{\sigma^2} \sim Gamma(\alpha, \beta)\), with \(\alpha = \beta = 0.001\).

Lastly, we can include in the model additional information, related to the elevation of the measuring stations, which might have an influence on the measured concentrations, as discussed in the chemico-physical modeling section.

The second level is indeed expressed as a group-level regression:

\[ \vec{\theta}_a = \vec{c} \cdot h_s + \vec{d} + \vec{\eta} \]

where \(h_s\) is the height of the measuring station \(s\), and is therefore common to all the observations of the same group for \(s = 1,2,3,4,5\), while \(\vec{\eta} \sim N(\vec{0}, \Sigma_a)\), and lastly \(\vec{c}\) and \(\vec{d}\) are the same for all the groups.

We have then the (hyper)priors \(\vec{c} \sim N(\vec{\theta_c}, \Sigma_c)\), \(\vec{d} \sim N(\vec{\theta_d}, \Sigma_d)\), \(\tau_a = \Sigma_a^{-1} \sim Wishart(\phi, \nu)\), with

\[ \vec{\theta_c} = \begin{bmatrix} 0\\ 0\\ 0\\ \end{bmatrix} \quad \quad \quad \vec{\theta_d} = \begin{bmatrix} 0\\ 0\\ 0\\ \end{bmatrix} \]

and

\[ \tau_c = \Sigma_c^{-1} = \begin{bmatrix} 0.001 & 0 & 0 \\ 0 & 0.001 & 0 \\ 0 & 0 & 0.001 \end{bmatrix} \quad \quad \quad \tau_d = \Sigma_d^{-1} = \begin{bmatrix} 0.001 & 0 & 0 \\ 0 & 0.001 & 0 \\ 0 & 0 & 0.001 \end{bmatrix} \]

while for the Wishart distribution we have \(\phi = 3\) degrees of freedom and scale matrix:

\[ \nu = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \]

Model comparison

In the bayesian modeling context, model comparison can be done using the Deviance Information Criterion. The relevant quantity here is the DIC, computed as:

\[ DIC = \bar{D} + 2p_D \]

where \(\bar{D}\) is the mean deviance, and \(p_D\) is the effective number of parameters.

The deviance is defined, up to an additive constant, as:

\[ D(\vec{\theta}) = -2 \space \text{log}(p(\vec{y} | \vec{\theta})) \]

where \(p(\vec{y}|\vec{\theta})\) is the likelihood.

The idea is that the lower the value of the DIC, the better the performance of the model.

Frequentist analysis

Pooled regression


We are going to perform a linear regression in the pooled case:

# Train model
model = lm(logNO_2 ~ logNO + O_3, data = train)

The summary of the model is:

## 
## Call:
## lm(formula = logNO_2 ~ logNO + O_3, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8327 -0.1738  0.0088  0.1688  0.5309 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.597296   0.102094   35.23   <2e-16 ***
## logNO        0.270802   0.024997   10.83   <2e-16 ***
## O_3         -0.018378   0.001544  -11.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2673 on 117 degrees of freedom
## Multiple R-squared:  0.8608, Adjusted R-squared:  0.8584 
## F-statistic: 361.6 on 2 and 117 DF,  p-value: < 2.2e-16

We can now compute the \(R^2\) and \(RMSE\) for the predictions of the concentrations, instead of the log-concentrations:

# Predict

y = predict(model, newdata = train); y = exp(y)
SSE = sum((y - train$NO_2)^2); SST = sum((mean(train$NO_2) - train$NO_2)^2)
R2 = 1 - SSE/SST
RMSE = sqrt(mean((y - train$NO_2)^2))
## R² = 0.839635 ; RMSE = 9.302237
Non-pooled regression


We are now going to do the same in the un-pooled case:

# Linear regression function

linear_regression = function(train){
  
  model = lm(logNO_2 ~ logNO + O_3, data = train)
  
  y = predict(model, newdata = train); y = exp(y)
  SSE = sum((y - train$NO_2)^2); SST = sum((mean(train$NO_2) - train$NO_2)^2)
  R2 = 1 - SSE/SST
  RMSE = sqrt(mean((y - train$NO_2)^2))
  
  return(c(R2, RMSE))
}


# Result

result = rbind(linear_regression(train1), linear_regression(train2),
           linear_regression(train3), linear_regression(train4),
           linear_regression(train5))
## R² = 0.742467, R² = 0.900974, R² = 0.866729, R² = 0.746080, R² = 0.879461
## RMSE = 10.706487, RMSE = 7.082097, RMSE = 3.989067, RMSE = 11.726829, RMSE = 9.764021

Bayesian analysis

Pooled regression


The first model we are going to define is a bayesian non-hierarchical model.

library(rjags, quietly = TRUE, verbose = FALSE, warn.conflict = FALSE)
library(runjags, quietly = TRUE, verbose = FALSE, warn.conflict = FALSE)
# Bayesian Model (not hierarchical)

Y = train$logNO_2
X.1 = train$logNO
X.2 = train$O_3
Y_pred = rep(NA, length(X.1))

# The model generates samples from the posterior distribution of the parameters
# as well as from the posterior predictive distribution of the target variable

bm = "
model {
    
  for (i in 1:length(X.1)){
    Y[i] ~ dnorm(mu[i], tau2)
    Y_pred[i] ~ dnorm(mu[i], tau2)
    mu[i] <- a[1]*X.1[i] + a[2]*X.2[i] + a[3]
  }
    
  a[1:3] ~ dmnorm(theta_a[], Tau2_a[,])
  tau2 ~ dgamma(alpha, beta)
}"

# Data

j.data = list(Y = Y,
             X.1 = X.1,
             X.2 = X.2,
             Y_pred = Y_pred,
             alpha = 0.001,
             beta = 0.001,
             theta_a = c(0,0,0),
             Tau2_a = 0.001*diag(3)
             )

# Initial values

j.init = list(
  list(a = c(-3,-3,-3), tau2 = 1.5),
  list(a = c(0,0,0), tau2 = 1),
  list(a = c(3,3,3), tau2 = 1.5)
)
# Execution
fit = run.jags(model = bm, monitor = c("a", "tau2", "Y_pred", "DIC"), data = j.data, inits = j.init, sample = 10000, n.chains = 3, thin = 2, burnin = 1000, separate.chains = TRUE)

Let’s have a look at the summaries for the distributions of the parameters:

## Calculating summary statistics...
## Calculating the Gelman-Rubin statistic for 124 variables....
##             Mean         SD     Lower95     Upper95      psrf
## a[1]  0.27091922 0.02514345  0.22271650  0.32140999 1.0001592
## a[2] -0.01837138 0.00155448 -0.02141011 -0.01526802 1.0002393
## a[3]  3.59670888 0.10289810  3.39457176  3.79906678 1.0002333
## tau2 13.98751780 1.83415067 10.41779863 17.58958462 0.9999995

We can also check the summaries for a few posterior predictive distributions:

##               Mean        SD  Lower95  Upper95      psrf
## Y_pred[1] 3.216221 0.2718369 2.688930 3.757229 0.9999853
## Y_pred[2] 2.902498 0.2732748 2.360442 3.430628 1.0000543
## Y_pred[3] 2.330466 0.2756251 1.780289 2.860190 0.9999797
## Y_pred[4] 2.348380 0.2744431 1.813908 2.885409 1.0000146
## Y_pred[5] 2.290516 0.2731717 1.765501 2.835385 0.9999989

Then, we show the traceplots, density plots, autocorrelation plots, and running means plots:

library(ggmcmc, quietly = TRUE, verbose = FALSE, warn.conflict = FALSE)

We can see that all of the traceplots show a good mixing among the three chains, and density plots look consistent with each other as well. Moreover, the autocorrelation plots show a very low autocorrelation in the chains, as soon as the number of iterations gets reasonably high, and the running means plots soon reach stability.

Lastly, the \(\hat{R}\) (i.e. the potential scale reduction factor psrf of the Gelman-Rubin statistics), which compares the within-chain variability and the between-chain variability, is sufficiently close to one for all the parameters.

In other words, there are no signs pointing to a lack of convergence.

Let’s show the \(RMSE\) and \(DIC\)

# Linear regression function
bayesian_regression = function(train){
  
  y = as.vector(output[5:nrow(output), "Mean"])
  y = exp(y)
  RMSE = sqrt(mean((y - train$NO_2)^2))
  
  return(RMSE)
}

result = bayesian_regression(train)
## RMSE = 9.295533, DIC = 28.987158

The \(RMSE\) is basically the same of the pooled regression in the frequentist case.

Hierarchical model


The next model is a hierarchical model.

# Hierarchical Model

Y = cbind(train1$logNO_2, train2$logNO_2, train3$logNO_2, train4$logNO_2, train5$logNO_2)
X.1 = cbind(train1$logNO, train2$logNO, train3$logNO, train4$logNO, train5$logNO)
X.2 = cbind(train1$O_3, train2$O_3, train3$O_3, train4$O_3, train5$O_3)
Y_pred = rep(NA, nrow(X.1)*5)

hm = "
model{
  
  for (k in 1:5){
    
    for (i in 1:n){
      Y[i,k] ~ dnorm(mu[i,k], tau2)
      Y_pred[i + n*(k-1)] ~ dnorm(mu[i,k], tau2)
      mu[i,k] <- a[1,k]*X.1[i,k] + a[2,k]*X.2[i,k] + a[3,k]
    }
    
    a[1:3,k] ~ dmnorm(theta_a[], Tau2_a[,])
  }
  
  tau2 ~ dgamma(alpha, beta)
  theta_a[1:3] ~ dmnorm(theta_t[], Tau2_t[,])
  Tau2_a[1:3,1:3] ~ dwish(nu[,], psi)
}"

# Data

j.data = list(Y = Y,
             X.1 = X.1,
             X.2 = X.2,
             n = nrow(X.1),
             Y_pred = Y_pred,
             alpha = 0.001,
             beta = 0.001,
             theta_t = c(0,0,0),
             Tau2_t = 0.001*diag(3),
             psi = 3,
             nu = diag(3)
             )

# Initial values

j.init = list(
  list(a = matrix(rep(-1,15), nrow = 3, ncol = 5), tau2 = 1.5, 
       theta_a = c(-3,-3,-3), Tau2_a = 1.5*diag(3)),
  list(a = matrix(rep(0,15), nrow = 3, ncol = 5), tau2 = 1, theta_a = c(0,0,0),
       Tau2_a = diag(3)),
  list(a = matrix(rep(1,15), nrow = 3, ncol = 5), tau2 = 1.5, 
       theta_a = c(3,3,3), Tau2_a = 1.5*diag(3))
)
# Execution
fit = run.jags(model = hm, monitor = c("a", "theta_a", "Tau2_a", "tau2", "Y_pred", "DIC"), data = j.data, inits = j.init, sample = 10000, n.chains = 3, thin = 2, burnin = 1000)

We can now check the summaries for the parameters of the first level:

## Calculating summary statistics...
## Calculating the Gelman-Rubin statistic for 148 variables....
## Note: Unable to calculate the multivariate psrf
##                Mean          SD     Lower95      Upper95     psrf
## a[1,1]  0.374351805 0.077959320  0.22573849  0.531785164 1.000066
## a[2,1] -0.016176451 0.004125452 -0.02422317 -0.008089646 1.000044
## a[3,1]  3.381905264 0.299504527  2.79869951  3.974915573 1.000057
## a[1,2]  0.284639736 0.071597047  0.14438955  0.425441812 1.000076
## a[2,2] -0.009295112 0.004544446 -0.01829461 -0.000477194 1.000028
## a[3,2]  3.396513462 0.343904851  2.70216148  4.048991259 1.000083
## a[1,3]  0.301328784 0.105598831  0.09712448  0.510890810 1.000013
## a[2,3] -0.012042532 0.007071156 -0.02577888  0.002074944 1.000019
## a[3,3]  3.290358457 0.310057932  2.69492915  3.912925522 1.000030
## a[1,4]  0.222090624 0.058492620  0.10390182  0.334305046 1.000079
## a[2,4] -0.024556760 0.003327151 -0.03135557 -0.018291624 1.000030
## a[3,4]  3.971308830 0.224957257  3.53763047  4.419373131 1.000043
## a[1,5]  0.339372032 0.103867576  0.13471343  0.546380402 1.000063
## a[2,5] -0.018924313 0.003819545 -0.02630696 -0.011254463 1.000068
## a[3,5]  3.254346177 0.454948691  2.35084546  4.148290616 1.000060
## tau2   18.929648144 2.589342402 14.06810464 24.127141946 1.000062
as well as the second level:
##                    Mean        SD    Lower95    Upper95      psrf
## theta_a[1]   0.30472012 0.2672410 -0.2379528  0.8288705 1.0001080
## theta_a[2]  -0.01461815 0.2560416 -0.5382660  0.4850298 1.0000217
## theta_a[3]   3.46048323 0.3836077  2.7011214  4.2125135 1.0000242
## Tau2_a[1,1]  6.79564337 3.6366187  0.8834590 14.0496866 0.9999672
## Tau2_a[2,1] -0.01660390 2.6121880 -5.3245671  5.1052800 1.0000134
## Tau2_a[3,1]  0.50197219 2.0786233 -3.5580673  4.8892065 1.0000404
## Tau2_a[1,2] -0.01660390 2.6121880 -5.3245671  5.1052800 1.0000134
## Tau2_a[2,2]  6.99264001 3.7584101  0.9437489 14.2854439 1.0000113
## Tau2_a[3,2]  0.04026703 2.0809928 -4.1703888  4.2817846 1.0000275
## Tau2_a[1,3]  0.50197219 2.0786233 -3.5580673  4.8892065 1.0000404
## Tau2_a[2,3]  0.04026703 2.0809928 -4.1703888  4.2817846 1.0000275
## Tau2_a[3,3]  4.33847936 2.5667993  0.5446739  9.4222400 1.0001343

and a few summaries for the posterior predictive distributions:

##               Mean        SD  Lower95  Upper95      psrf
## Y_pred[1] 3.193193 0.2477794 2.715044 3.686769 1.0000841
## Y_pred[2] 2.919294 0.2395404 2.457042 3.395743 1.0000788
## Y_pred[3] 2.265607 0.2463663 1.791025 2.757417 1.0001097
## Y_pred[4] 2.280021 0.2450006 1.791600 2.751814 0.9999708
## Y_pred[5] 2.232367 0.2480802 1.739510 2.713805 1.0000518

Then let’s have a look at the traceplots, density plots, autocorrelation plots and running means plots:

We observe again a good mixing among the three chains in all the traceplots, and a good consistency among the density plots for different chains. Moreover, the \(\hat{R}\) is clearly close to \(1\) for all the parameters.

Once again, the autocorrelation plots show a low autocorrelation in the chains, and the running means plots exibit a good behaviour.

Overall, there are no signs pointing to a lack of convergence.

We can now compute \(RMSE\) and \(DIC\).

# Linear regression function

bayesian_regression = function(n, train){
  
  # Take the section of the output containing the right part of Y_train
  # (the values of Y_train start at row 29 in the JAGS output)
  y_start = 29 + nrow(train)*(n-1)
  y_end = 29 + nrow(train)*n - 1
  y = as.vector(output[y_start:y_end, "Mean"])
  y = exp(y)
  RMSE = sqrt(mean((y - train$NO_2)^2))
  
  return(RMSE)
}
# Result
result = c(bayesian_regression(1, train1),
           bayesian_regression(2, train2),
           bayesian_regression(3, train3), 
           bayesian_regression(4, train4), 
           bayesian_regression(5, train5),
           as.double(fit$dic[1]))
## RMSE = 10.767396, RMSE = 7.146438, RMSE = 4.056192, RMSE = 11.321131, RMSE = 9.428728
## DIC = 3.445629

The values of the \(RMSE\) are more or less the same we got in the non-pooled frequentist case. However, can see a net improvement of the \(DIC\) over the previous case (the bayesian non-hierarchical model)

Hierarchical model (with group-level predictors)


Lastly, we have a hierarchical model with group-level predictors (the heights of the measuring stations)

# Hierarchical Model (with group-level predictors)

Y = cbind(train1$logNO_2, train2$logNO_2, train3$logNO_2, train4$logNO_2, train5$logNO_2)
X.1 = cbind(train1$logNO, train2$logNO, train3$logNO, train4$logNO, train5$logNO)
X.2 = cbind(train1$O_3, train2$O_3, train3$O_3, train4$O_3, train5$O_3)
Y_pred = rep(NA, nrow(X.1)*5)
h = c(693, 670, 659, 621, 604)

hm = "
model {
  for (k in 1:5){
    
    for (i in 1:n){
      Y[i,k] ~ dnorm(mu[i,k], tau2)
      Y_pred[i + n*(k-1)] ~ dnorm(mu[i,k], tau2)
      mu[i,k] <- a[1,k]*X.1[i,k] + a[2,k]*X.2[i,k] + a[3,k]
    }
    
    a[1:3,k] ~ dmnorm(theta_a[,k], Tau2_a[,])
    theta_a[1:3,k] <- c[]*h[k] + d[]
    
  }
  
  tau2 ~ dgamma(alpha, beta)
  Tau2_a[1:3,1:3] ~ dwish(nu[,], psi)
  
  c[1:3] ~ dmnorm(theta_c[], Tau2_c[,])
  d[1:3] ~ dmnorm(theta_d[], Tau2_d[,])
}"

# Data

j.data = list(Y = Y,
             X.1 = X.1,
             X.2 = X.2,
             n = nrow(X.1),
             Y_pred = Y_pred,
             h = h,
             alpha = 0.001,
             beta = 0.001,
             theta_c = c(0,0,0),
             Tau2_c = 0.001*diag(3),
             theta_d = c(0,0,0),
             Tau2_d = 0.001*diag(3),
             psi = 3,
             nu = diag(3)
             )

# Initial values

j.init = list(
  list(a = matrix(rep(-1,15), nrow = 3, ncol = 5), tau2 = 1.5, c = c(-3,-3,-3), 
       d = c(-3,-3,-3), Tau2_a = 1.5*diag(3)),
  list(a = matrix(rep(0,15), nrow = 3, ncol = 5), tau2 = 1, c = c(0,0,0), 
       d = c(0,0,0), Tau2_a = diag(3)),
  list(a = matrix(rep(1,15), nrow = 3, ncol = 5), tau2 = 1.5, c = c(3,3,3), 
       d = c(3,3,3), Tau2_a = 1.5*diag(3))
)
# Execution

fit = run.jags(model = hm, monitor = c("a", "c", "d", "Tau2_a", "tau2", "Y_pred", "DIC"), data = j.data, inits = j.init, sample = 10000, n.chains = 3, thin = 2, burnin = 1000)
Once again, we show the summaries for the first level parameters:
## Calculating summary statistics...
## Calculating the Gelman-Rubin statistic for 151 variables....
## Note: Unable to calculate the multivariate psrf
##                Mean          SD     Lower95       Upper95     psrf
## a[1,1]  0.389724450 0.083455883  0.22336706  0.5504353058 1.004363
## a[2,1] -0.015399791 0.004418937 -0.02415828 -0.0069201233 1.003497
## a[3,1]  3.321523064 0.323686209  2.68092949  3.9457723279 1.004077
## a[1,2]  0.292084967 0.073472454  0.14206885  0.4319485939 1.001385
## a[2,2] -0.008863408 0.004700294 -0.01797301  0.0005136348 1.001532
## a[3,2]  3.359732887 0.355564613  2.65459753  4.0528155285 1.001686
## a[1,3]  0.306609973 0.106793213  0.09611423  0.5165538105 1.000174
## a[2,3] -0.011692900 0.007141553 -0.02562272  0.0024635906 1.000301
## a[3,3]  3.274542152 0.313629711  2.65005684  3.8857366028 1.000285
## a[1,4]  0.213753038 0.059530829  0.09553135  0.3293280032 1.000820
## a[2,4] -0.025001887 0.003385428 -0.03154033 -0.0183154472 1.001010
## a[3,4]  4.004430733 0.229325377  3.54409227  4.4421269297 1.001071
## a[1,5]  0.315517812 0.125616629  0.07374410  0.5665972389 1.010123
## a[2,5] -0.019726267 0.004498970 -0.02843078 -0.0107916209 1.008924
## a[3,5]  3.360595465 0.553399436  2.26998053  4.4344391230 1.010312
## tau2   18.893892719 2.608295532 13.91974084 24.1041694028 1.000128

the second level parameters

##                      Mean          SD      Lower95     Upper95     psrf
## c[1]         0.0024936805 0.007881413  -0.01419171  0.01736973 1.079203
## c[2]        -0.0003549927 0.007881306  -0.01578510  0.01562336 1.021234
## c[3]        -0.0037831842 0.013699368  -0.02872478  0.02310721 1.059737
## d[1]        -1.3186483261 5.121619159 -10.81806474  9.66977281 1.079019
## d[2]         0.2151021655 5.121166752 -10.05222867 10.36158568 1.021299
## d[3]         5.9204174458 8.916735663 -11.80934960 21.96089737 1.059582
## Tau2_a[1,1]  6.0232765080 3.392655000   0.80632735 12.85955676 1.001695
## Tau2_a[2,1] -0.0322555191 2.452125198  -5.01986816  4.83480680 1.000949
## Tau2_a[3,1]  0.3762040311 2.023564144  -3.74073093  4.52705689 1.001536
## Tau2_a[1,2] -0.0322555191 2.452125198  -5.01986816  4.83480680 1.000949
## Tau2_a[2,2]  6.1457854551 3.494146674   0.60430629 13.04085060 1.000448
## Tau2_a[3,2] -0.0032361395 2.000050278  -4.00204439  4.15035278 1.001166
## Tau2_a[1,3]  0.3762040311 2.023564144  -3.74073093  4.52705689 1.001536
## Tau2_a[2,3] -0.0032361395 2.000050278  -4.00204439  4.15035278 1.001166
## Tau2_a[3,3]  4.0335195612 2.602029154   0.22600946  9.04889876 1.000894

and the target variable:

##               Mean        SD  Lower95  Upper95     psrf
## Y_pred[1] 3.180304 0.2510738 2.690604 3.673477 1.000463
## Y_pred[2] 2.918378 0.2394844 2.433942 3.374634 1.000121
## Y_pred[3] 2.259860 0.2479346 1.770463 2.750372 1.000073
## Y_pred[4] 2.274651 0.2480855 1.789248 2.762216 1.000037
## Y_pred[5] 2.228136 0.2471927 1.733347 2.706908 1.000077

Then, we show the traceplots, density plots, autocorrelation plots and running means plots:

In this case, we can observe clear issues in convergence, especially in the plots for the parameters \(\vec{c}\) and \(\vec{d}\). Indeed, the three chains show a non-optimal mixing, and the density plots are inconsistent with each other as well.

The values of the \(\hat{R}\) are rather distant from \(1\), and we notice an excessively high autocorrelation, together with an instability of the running means.

We can now try to check if an increase of the number of iterations and tinning can bring the chains to convergence. However, in order to do so, we will limit the monitored parameters to \(\vec{a}\), \(\vec{c}\) and \(\vec{d}\) (otherwise, the system would crash)

# Execution

fit = run.jags(model = hm, monitor = c("a", "c", "d", "DIC"), data = j.data, inits = j.init, sample = 30000, n.chains = 3, thin = 5, burnin = 5000)

The chains have now a better mixing, and a better consistency of the density plots. However, the autocorrelation is still very high, and the running means rather unstable.

Conclusions

The bayesian models we have used seem to give reasonably good predictions on the training data. The accuracy, in terms of \(RMSE\) is comparable to the one we obtain in the frequentist models. The bayesian model comparison through \(DIC\) shows a net improvement in the hierarchical model with respect to the non-hierarchical model.

The hierarchical model with group level predictors, however shows a poor consistency among the chains for the parameters of the second level regression, together with an high autocorrelation and instability of the running posterior means.

This issue cannot, apparently, be solved by increasing the number of iterations or the tinning (which provide only light benefits to the results), and might require an Hamiltonian Monte Carlo sampler, which could be capable of a wider exploration of the state space, leading to lower autocorrelation and better overall convergence.